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Abstract 

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing 
DNA directly from an environmental sample. One of the main challenges in metagenomics data 
analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the 
large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can 
operate with reasonable computing requirements. While standard alignment-based methods provide 
state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read 
based on the fc-mers it contains have the potential to provide faster solutions. 

In this work, we investigate the potential of modern, large-scale machine learning implementa¬ 
tions for taxonomic affectation of next-generation sequencing reads based on their fc-mers profile. 
We show that machine learning-based compositional approaches benefit from increasing the number 
of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, 
and from increasing the fc-mer size to about 12. Tuning these models involves training a machine 
learning model on about 10® samples in 10^ dimensions, which is out of reach of standard soft¬ 
wares but can be done efficiently with modern implementations for large-scale machine learning. 
The resulting models are competitive in terms of accuracy with well-established alignment tools for 
problems involving a small to moderate number of candidate species, and for reasonable amounts of 
sequencing errors. We show, however, that compositional approaches are still limited in their abil¬ 
ity to deal with problems involving a greater number of species, and more sensitive to sequencing 
errors. We finally confirm that compositional approach achieve faster prediction times, with a gain 
of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of 
candidate species and the level of sequencing noise. 


1 Introduction 


Recent progress in next-generation sequencing (NGS) technologies allow to access large amounts of 
genomic data within a few hours at a reasonable cost ( [Soon et ^ [2013 1. In metagenomics, NGS is 
used to analyse the genomic content of microbial communities by sequencing all DNA present in an 
environmental sample ( |Riesenfeld et al.\ 2004 1 . It gives access to all organisms present in the sample 
even if they do not grow on culture media (Hugenholtz et al. 20021, and allows us to characterize with 
an unprecedented level of resolution the diversity of the microbial realm ( Peterson et al.\ 20091. 

The raw output of a metagenomics experiment is a large set of short DNA sequences (reads) obtained 
by high-throughput sequencing of the DNA present in the sample. There exist two main approaches to 
analyze these data, corresponding to slightly different goals. On the one hand, taxonomic profiling aims 
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to estimate the relative abundance of the members of the microbial community, without necessarily af¬ 
fecting each read to a taxonomic class. Recent works like WGSQuikr (Koslicki et al. 20141 or GASIC 
( Lindner! 2012| ) proved to be very efficient for this purpose. Taxonomic binning methods, on the other 
hand, explicitly affect each read to a taxonomic clade. This process can be unsupervised, relying on clus¬ 
tering methods to affect reads to operational taxonomic units (OTU), or supervised, in which case reads 
are individually affected to nodes of the taxonomy ( Mande gta/.[|2012 l. While binning is arguably more 
challenging that profiling, it is a necessary step for downstream applications which require draft-genome 
reconstruction. This may notably be the case in a diagnostics context, where further analyses could aim 
to detect pathogen micro-organisms ( Miller g? a/.[|2013 l or antibiotic resistance mechanisms (Schmieder 
\et a/.[[2012| ). 

In this paper we focus on the problem of supervised taxonomic binning, where we wish to as¬ 
sign each read in a metagenomics sample to a node of a pre-defined taxonomy. Two main computa¬ 
tional strategies have been proposed for that purpose: (i) alignment-based approaches, where the read 
is searched against a reference sequence database with sequence alignment tools like BLAST (|Huson| 


et al. 20071 or short read mapping tools (e.g., BWA, Li et al. 20091, and (ii) compositional approaches. 


where a machine learning model such as a naive Bayes (NB) classifier (Wang et al.\ 2007 1 Parks et al] 


20111 or a supporf vector machine (SVM, McHardy et al.\ 2007 Pafil et aL\ 20121 is framed to label fhe 


read based on fhe sef of A;-mers if confains. Since fhe faxonomic classificafion of a sequence by compo¬ 
sitional approaches is only based on fhe sef of fc-mers if confains, fhey can offer significanf gain in ferms 
of classificafion fime over similarify-based approaches. Training a machine learning model for faxo¬ 
nomic binning can however be compufafionally challenging. Indeed, composifional approaches musf be 
frained on a sef of sequences wifh known faxonomic labels, fypically obfained by sampling error-free 
fragmenfs from reference genomes. In fhe case of NB classifiers, explicif sampling of fragmenfs from 
reference genomes is nof needed fo frain fhe model: insfead, a global profile of fe-mer abundance from 
each reference genome is sufficienf fo esfimafe fhe paramefers of fhe NB model, leading fo simple and 
fasf implemenfafions (Wang et al. 2007 1 Rosen g? aH|201l| Parks et aH|2011 1. On fhe ofher hand, in 
fhe case of SVM and relafed discriminative mefhods, an explicif sampling of fragmenfs from reference 
genomes fo frain fhe model based on fhe A;-mer confenf of each fragmenf is needed, which can be a 
limifafion for sfandard SVM implemenfafions. For example, Pafil et al. ( |2012| sampled approximafely 
10,000 fragmenfs from 1768 genomes fo frain a sfrucfured SVM (based on a fc-mer represenfafion wifh 
k = 4,5,6), and reporfed an accuracy competitive wifh similarify-based approaches. Increasing fhe 
number of fragmenfs sampled fo frain a SVM may improve ifs accuracy, and allow us fo invesfigafe 
larger values of k. However if also raises compufafional challenges, as if involves machine learning 
problems where a model musf be frained from pofenfially millions or billions of framing examples, each 
represented by a vector in 10^ dimensions for, e.g., k = 12. 

In fhis work, we invesfigafe fhe pofenfial of composifional approaches for faxonomic label assign- 
menf using modem, large-scale machine learning algorifhms. 


2 Methods 

2.1 Linear models for read classification 

In mosf of compositional mefagenomics applicafions, a sequence is represenfed by ifs /c-mer profile, 
namely, a vecfor counfing fhe number of occurrences of any possible word of k letters in fhe sequence. 
Only fhe A, T, C, G nucleofides are usually considered fo define k-m&v profiles, fhaf are fherefore 4^- 
dimensional vecfors. Alfhough fhe size of fhe fc-mer profile of a sequence of lengfh I increases expo- 
nenfially wifh k, if confains af mosf I — k + 1 non-zero elemenfs since a sequence of lengfh k confains 
I — k + \ differenf /c-mers. 

Given a sequence represenfed by ifs k-mer profile x G M** , we consider linear models fo assign if fo 
one of K chosen faxonomic classes. A linear model is a sef of weighf vecfors wi,..., wk G fhaf 


2 












































assign x to the class 


are max w- x, 
j=h-,K ^ 

where w~^x is the standard inner product between vectors. To train the linear model, we start from a 
training set of sequences xi,..., G with known taxonomic labels ci,..., G {1,..., K}. 
A NB classifier, for example, is a linear model where the weights are estimated from the /c-mer count 
distributions on each class. Another class of linear models popular in machine learning, which include 
S VM, are the discriminative approaches that learn the weights by solving an optimization problem which 
aims to separate the training data of each class from each other. More precisely, to optimize the weight 
Wj of the j-th class, one typically assigns a binary label yi to each training example (y* = 1 if c, = j, or 
yi = —1 otherwise) and solves an optimization problem of the form 


1 ” 
min — > 
n 

2 = 1 


i{yi,w~^Xi) + Alim] 


( 1 ) 


where £{y, t) is a loss function quantifying how "good" the prediction t is if the true label is y, and A > 0 
is a regularization parameter to tune, helpful to prevent overfitting in high dimension. A SVM solves 
Q with the hinge loss i{y,t) = max{0 ,1 — yt), but other losses such as the logistic loss i{y,t) = 
log(l + exp(—yt)) or the squared loss i{y,t) = (y — t)^ are also possible and often lead to models 
with similar accuracies. These models have met significant success in numerous real-world learning 


tasks, including compositional metagenomics ( Patil et aL\ 20121. In this work, we use the squared loss 
function and choose A = 0, leading to no regularization. 


2.2 Large-scale learning of linear models 

Although learning linear models by solving ([T]l is now a mature technology implemented in numerous 
softwares, metagenomics applications raise computational challenges for most standard implementa¬ 
tions, due to the large values that n (number of reads in the training set), p = 4^ (dimension of the 
models) and K (number of taxonomic classes) can take. 

The training set is typically obtained by sampling fragments from reference genomes with known 
taxonomic class. For example, Patil et al. ( 2012| l sampled approximately n = 10,000 fragments from 
1, 768 genomes to train SVM models based on fe-mer profiles of size A: = 4, 5, 6. However, fhe number 
of disfincf fragmenfs fhaf may be drawn from a genome sequence is approximafely equal fo ifs lengfh (by 
sampling a fragmenf sfarfing al each position in fhe genome), hence can reach several millions for each 
microbial genome, leading lo pofenfially billions of Iraining sequences when Ihousands of reference 
genomes are used. While considering every possible fragmenf from every possible genome may nol be 
fhe besl choice because of fhe possible redundancy belween fhe reads, if may still be useful lo consider 
a significanl number of fragmenfs lo properly accounl for fhe infra and infer species genomic variabilily. 
Similarly, exploring models wilh k larger lhan 6, say 10 or 15, may be inleresling bul requires (i) fhe 
capacity lo manipulale fhe corresponding 4^-dimensional vecfors (4^® ~ 10®), and (ii) large Iraining 
sefs since many examples are needed fo learn a model in high dimension. Finally, real-life applicafions 
involving aclual environmenlal samples may confain several hundreds microbial species, casting fhe 
problem info a relafively massive multiclass scenario oul of reach of mosl slandard implemenlalions of 
SVM. 

To solve Q efficienlly when n, k and K fake large values, we use a dedicafed implemenlafion of 
slochaslic gradienl descenl (SGD Bollou 19981 available in fhe Vowpal Wabbil soflware (VW Langford] 
et al. 2007 1 Agarwal et al. 20141. In shorl, SGD exploils fhe facl fhaf fhe objective function in Q is 


an average of n ferms, one for each Iraining example, lo approximafe fhe gradienl al each sfep using 
a single, randomly chosen ferm. Alfhough SGD requires more steps lo converge lo fhe solution lhan 
slandard gradienl descenl, each step is n times faster and fhe melhod is overall faster and more scalable. 
In addition, allhough Ihe dimension y = 4*^ of Ihe dala is large, VW exploils Ihe facl lhal each Iraining 
example is sparse, leading lo efficienl memory storage and fasl updates al each SGD step. We refer 
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Acetobacter pasteurianus 
Acinetobacter baumannii 
Bacillus amyloliquefaciens 
Bacillus anthracis 
Bacillus subtilis 
Bacillus thuringiensis 
Bifidobacterium bifidum 
Bifidobacterium longum 
Borrelia burgdorferi 
Brucella abortus 
Brucella melitensis 
Buchnera aphidicola 
Burkholderia mallei 
Burkholderia pseudomallei 
Campylobacter jejuni 
Corynebacterium pseudotuberculosis 
Corynebacterium ulcerans 
Coxiella burnetii 
Desulfovibrio vulgaris 
Enterobacter cloacae 
Escherichia coli 
Erancisella tularensis 
Helicobacter pylori 
Legionella pneumophila 
Leptospira interrogans 
Listeria monocytogenes 


Methylobacterium extorquens 
Mycobacterium bovis 
Mycobacterium tuberculosis 
Mycoplasma fermentans 
Mycoplasma genitalium 
Mycoplasma mycoides 
Mycoplasma pneumoniae 
Neisseria gonorrhoeae 
Propionibacterium acnes 
Pseudomonas aeruginosa 
Pseudomonas stutzeri 
Ralstonia solanacearum 
Rickettsia rickettsii 
Shigella flexneri 
Staphylococcus aureus 
Streptococcus agalactiae 
Streptococcus equi 
Streptococcus mutans 
Streptococcus pneumoniae 
Streptococcus thermophilus 
Thermus thermophilus 
Treponema pallidum 
Yersinia enterocolitica 
Yersinia pestis 
Yersinia pseudotuberculosis 


Table 1: List of the 51 microbial species in the mini reference database. 


the interested reader to Bottou (20101 for more discussion about the relevance of SGD in large-scale 
learning. In practice, VW can train a model with virtually no limit on n as long as the data can be stored 
on a disk (they are not loaded in memory). As for k, VW can handle up to 2^^ distinct features, and 
the count of each k-m&r is randomly mapped to one feature by a hash table. This means that we have 
virtually no limit on k, except that when k approaches or exceeds the limit (such that 4^ = 2^^, i.e., 
k = 16), collisions will appear in the hash table and different A:-mers will be counted together, which 
may impact the performance of the model. 


3 Data 

We simulate metagenomics samples by generating reads from three different reference databases, which 
we refer to below as the mini, the small and the large databases. 

The mini reference database contains 356 complete genome sequences covering 51 bacterial species, 
listed in Table [T] We use this database to train and extensively vary the parameters of the different 
models. To measure the performance of the different models, we generate new fragments from 52 
genomes not present in the reference database, but originating from one of the 51 specie^ 

The small and large databases are meant to represent more realistic situations, involving a larger 
number of candidate bacterial species and a larger number of reference genomes. To define the refer¬ 
ence and validation databases that will respectively be used to build and evaluate the predictive models, 

*Two genomes are indeed available for the Francisella tularensis species, one of which originating from the novicida 
subspecies. 
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we first downloaded the 5201 complete bacterial and archeal genomes available in the NCBI RefSeq 
database as of July 2014 ([Pruitt et al.\ |2012|, by means of a functionality embedded in the Fragment 


Classification Package (FCP) (Parks etal] 20111. We then filtered these sequences according to a cri¬ 


terion proposed in Parks et al. (20111: we only kept genomes that belong to genera represented by at 
least 3 species. We also removed genomes represented by less than 10® nucleotides in order to filter 
draft genome sequences, plasmids, phages, contigs and other short sequences. The 2961 remaining se¬ 
quences originate from 774 species, among which 193 are represented by at least 2 strains. We split the 
sequences of these 193 species into two parts. We randomly pick one strain within each of these 193 
species to define a validation database, that will be used to estimate classification performance, through 
the sampling of genomic fragments or the simulation of sequencing reads. The remaining sequences of 
these 193 species define a first reference database, referred to as small below. In addition we define a 
larger reference database by adding to the small database described above the genomes originating from 
the 774 — 193 = 581 species represented by a single genome. The larger database, referred to as large 
below, therefore involves the 11A species available after filtering the NCBI database, and not solely the 
193 represented in the validation database. 


4 Results 


4.1 Proof of concept on the mini database 

In this section we present a study on the mini dataset, aiming to evaluate the impact of increasing the 
number of fragments used to train the model as well as the length of the /c-mers considered. For that 
purpose, we learn several classification models based on fragments of length L = 200 or L = 400 
sampled from the 356 reference genomes in the mini reference database, represented by A:-mers of size 
in {4, 6, 8,10,12}. The number of fragments used to learn the models is gradually increased by drawing 
several "batches" of fragments in order to cover, on average, each nucleotide of the reference genomes 
a pre-defined number of times c. We vary the coverage c between 0.1 to its maximal value, equal to 
the length of the fragments considered. This leads to learning models from around n = 2.7 X 10®, for 
c = 0.1 and L = 400, up to around n = 1.1 x 10® fragments, when c reaches its maximal value. This is 
way beyond the configurations considered for instance in jPatil et al. (2012 1 , where SVM models were 
learned from approximately 10^ fragments drawn from 1, 768 genomes. 

To assess the performance of these models we consider two sets of 134, 319 fragments, of respective 
length 200 and 400, drawn from the 52 complete genomes that are not in the reference database used 
to train the models. Performance is measured by first computing, for each species, the proportion of 
fragments that are correctly classified, and considering its median value across species. In a multiclass 
setting, this indicator is indeed less biased towards over-represented classes than the global rate of correct 
classification. 

Figure [T] shows the performance reached by models based on fragments of length 200 (left) or 400 
(right), for different values of k (horizontal axis) and different coverages (different colors). We first 
note that for c = 0.1, that is, for a limited number of fragments, the classification performance starts 
by increasing with the size of the A;-mers (up to fc = 8 and A: = 10 for fragments of length 200 and 
400, respectively), and subsequently decreases. This suggests that the number of fragments considered 
in this setting is not sufficient to efficiently learn when the dimensionality of the feature space becomes 
too large. Note that twice as many fragments of length 200 as fragments of size 400 are drawn for a 
given coverage value, which may explain why performance still increases beyond k = 8 with smaller 
fragments. Increasing the number of fragments confirms this hypothesis : performance systematically 
increases or remains steady with fe for c > 1, and for k > 8, the performance is significantly higher than 
that obtained at c = 0.1, for both length of fragments. Increasing the coverage from c = 1 to c = 10 has 
a positive impact in both cases, although marginally for fragments of length 400. Further increasing the 
number of fragments does not bring any improvement. 

Altogether, the optimal configuration on this mini dataset involves /c-mers of size 12 and drawing 
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Figure 1: Increasing the number of fragments and the /c-mer size on the mini datasets. Left: 
L = 200bp fragments. Right: L = 400bp fragments. These figures show the median species-level 
accuracy obtained by linear predictors trained with Vowpal Wabbit from fragments covering each refer¬ 
ence genome with a mean coverage c from 0.1 (red) to L (gold). Performances are reported as a function 
of /c-mer sizes. 


fragments at a coverage c > 10 for the two lengths of fragments considered. Further increasing the size 
of the /c-mers did not bring improvements, and actually proved to be challenging. Indeed, as mentioned 
above, VW proceeds by hashing the input features into a vector offering at most 2^^ entries. This hashing 
operation can induce collisions between features, which can be detrimental to the model if the number of 
features becomes too high with respect to the size of the hash table. This issue is even more stressed in a 
multiclass setting, where the number of hash table entries available per model is divided by the number 
of classes considered. On this dataset, 51 models have to be stored in the hash table, which reduces the 
number of entries available per model to 2^^/51 ~ = 4^^. We have empirically observed that 

performance could not increase for k greater than 12 and actually decreased for /c-mers greater than 15. 

We now compare these results to two well-established approaches: a comparative approach based 
on the BWA-MEM sequence aligner (Li 20131 and a compositional approach based on the generative 
NB classifier (Rosen et al. 2011|). The NB experiments rely on the FCP implementation (Parks et al. 


20111 and are carried out in the same setting as VW: we compute profiles of /c-mers abundance for 


the 356 genomes of the reference database, and use them to affect test fragments to their most likely 
genome. BWA-MEM is configured fo solely refurn hifs wifh maximal score (opfion -T 0). Unmapped 
fragmenls are counted as misclassificalions, and a single hif is randomly picked in case of multiple hifs, 
in order fo obfain a species-level predicfion. This laffer random hif selection process is repealed 20 
times and Ihe performance indicalor reported below corresponds fo ils median value obfained across 
repelilions. Resulls are shown in Eigure|^ We firsl note lhal /c-mer based approaches, eilher generalive 
or discriminative, never oulperform Ihe alignmenl-based approach. Comparable resulls are neverlheless 
obfained for /c > 10 wifh VW, and k = 12 wifh Ihe NB. Performances obfained for shorter /c-mers are 
markedly lower fhan lhaf obfained by BWA-MEM. We note finally lhal VW generally oulperforms Ihe 
NB classifier, excepl for small fc-mers and short fragmenls (/c < 6 and L = 200). 

In summary, Ihese experimenls demonslrale Ihe relevance and feasibilily of large-scale machine 
learning for laxonomic binning: we obfain a performance comparable fo lhal of Ihe well-eslablished 
alignmenl-based approach, provided a sufficienl number of fragmenls and long enough /c-mers are con¬ 
sidered lo learn Ihe /c-mers based predictive models. 
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Figure 2: Comparison between Vowpal Wabbit and reference methods on the mini datasets. Left: 
L = 200bp fragments. Right: L = 400bp fragments. These figures show the median species-level accu¬ 
racy obtained by linear predictors trained with Vowpal Wabbit from fragments covering each reference 
genome with a mean coverage equal to 10 (purple solid line). Performances are reported as a function 
of k-m&r sizes. This approach is compared to the standard compositional Naive Bayes approach (green 
dotted line), and an alignment-based approach based on BWA (grey dotted line). 

4.2 Evaluation on the small and large reference databases 

We now proceed to a more realistic evaluation involving a larger number of candidate microbial species 
and a larger number of reference genomes, using the small and large reference databases. We learn 
classification models according to the configuration suggested by the evaluation on the mini database: 
we consider /c-mers of size 12 and a number of fragments allowing to cover each base of the reference 
genomes 10 times in average. We limit our analysis to fragments of length 200, which leads to learn 
models from around n = 1.38 x 10® and n = 2.56 x 10® fragments for the small and large reference 
databases, respectively. Note that due to the larger number of species involved, around 2®^“® = 4^^ and 
232-10 _ ^11 ejitries of the VW hash table are available per model for each of these reference databases. 
Based on the results with the mini database, this should be compatible with A:-mers of size 12. We 
evaluate the performance of the models on fragments extracted from the 193 genomes of the validation 
database and draw a number of reads necessary to cover each base of each genome once in average, 
which represents around 3.5 x 10® sequences. Results obtained by VW are again compared to that ob¬ 
tained by the two baseline approaches involved in the previous proof of concept, namely BWA-MEM 
and NB, and are shown in Table]^ We first note that for the small reference database, the performance of 
VW and BWA-MEM are very similar (median species-level accuracy of 92.4% and 93%, respectively). 
The NB classifier, on the other hand, has a significantly lower performance, with 8% less in median 
accuracy than the alignment-based approach. Considering a larger number of candidate species in the 
large reference database has little impact on the alignment-based approach, where we observe a perfor¬ 
mance drop of only 1% (91.9% vs 93%). It impacts more severely compositional approaches, with both 
NB and VW accuracies dropping by about 5%. This suggests that fc-mer based approach are still limited 
in their ability to deal with problems involving more than a few hundreds of candidate species. 
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small database 

large database 

Vowpal Wabbit 

92.4 

87.7 

Naive Bayes 

85.1 

79.8 

BWA-MEM 

93.0 

91.9 


Table 2: Performance on the small and large reference databases. This table gives the median 
species-level accuracy obtained by Vowpal Wabbit (VW), Naive Bayes (NB) and BWA-MEM using 
the small and large reference databases. Compositional approaches (VW and NB) performances are 
reported for fe-mers of size 12. VW performances are reported for a mean coverage c = 10. 


4.3 Robustness to sequencing errors 

The evaluation performed in the previous sections is based on taxonomic classification of DNA frag¬ 
ments drawn from reference genomes without errors. In real life, sequencing errors may alter the read 
sequences and make the classification problem more challenging. To evaluate the robustness of the 
classifiers to sequencing errors, we generate new reads simulating sequencing errors using the Grinder 
read simulation software ( Angly et al.\ 2012| ). We consider two types of sequencing errors models: ho¬ 
mopolymeric stretches, which are commonly encountered in pyrosequencing technologies (e.g., Roche 
454), and general mutations (substitutions and insertions/deletions). In order to be able to compare 
the results of the fragment- and read-based evaluations, we systematically simulate reads of length 200 
(exactly), and simulate around 3.5 x 10® sequences as well. 


Homopolymeric error models. 

To evaluate the impact of homopolymeric errors, we consider the three error models implemented in 
Grinder : Balzer ( Balzer et a/.[|2010| |, Richter (Richter et aZ. 2008l andMargulies ( Margulies] 
et al. 20051. Results are shown in Figure We first note that this kind of errors has a very limited 
impact on BWA-MEM: only the Margulies model turns out to be detrimental, with a drop of less 
than 1% for both the small and large reference databases. The Balzer and Richter models have a 
limited impact on the compositional approach as well: a drop of less than 1% is observed as well in most 
cases (except with the NB classifier, where a drop of almost 2% is observed using the large reference 
database and the Richter model). The Margulies model, on the other hand, has a much more se¬ 
vere impact on the performance of fc-mer based approaches. While a relatively limited performance drop 
of around 1.5% is observed with VW using the small reference database, the NB shows a drop of more 
than 5%. Considering the large reference database, both approaches show a drop of almost 8%, which 
therefore leads to a gap of more than 10% and up to 20%, for VW and NB respectively, compared to the 
performance of the alignment-based approach. This discrepancy is therefore significantly higher than 
the one observed from fragments, where VW and NB have performance lower than that of BWA-MEM 
by around 4% and 12%, respectively, on the large reference database. Analyzing the error profile of the 
reads obtained by Grinder reveals that both the Balzer and Richter models lead to a median muta¬ 
tion rate of 0.5% (meaning that half of the 200 bp simulated reads show more than one modified base), 
while this rate raises to 3% with the Margulies model. While this can readily explain why this latter 
model had a stronger impact, it suggests that what may be seen as a relatively moderate modification of 
the sequences (6 bases out of 200) can have a severe impact on compositional approaches. 


Mutation error model. 

To study the impact of general mutation errors, we consider the 4th degree polynomial proposed by 
Korbel et aZ.| (2009l and implemented in Grinder. Using the default values proposed by Grinder, we 
empirically observe a median mutation rate of 10.7%. This value is much more important than what is 
expected by current NGS technologies, and is probably due to the fact that this model was calibrated 
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Figure 3: Robustness to sequencing errors: homopolymer-based models. Top: small reference 
database. Bottom: large reference database. These figures give the median species-level accu¬ 
racy obtained with VW (purple), NB (green) and BWA (grey). Each approach has been evaluated 


on three datasets simulated according to three different error models: Balzer (Balzer et al. 20101, 
Richter ([Richter et a/.[|2008|), and Margulies ([Margulies et af.||2005|l. 


from shorter reads. Indeed, the median mutation rate decreases to around 1.5% when we reduce the 
length of the reads to 30, in agreement with the results of the original publication (Korbel et al. 20091. 
To investigate in details the impact of mutations within reads of length 200, we modify the parameters 
of the error model in order to gradually increase the median mutation rate from 1% to 10%, by 1%. This 
therefore leads to simulating 11 datasets, since we consider in addition the default Grinder configuration. 
Results are shown in Figure We first note that this type of errors has a very limited impact on 
alignment-based approach: even at the higher rate of mutation considered (median mutation rate of 
10.7%), the performance drops by around 1% with respect to the performance obtained with fragments, 
for both the small and large reference databases. On the other hand, the performance obtained with 
compositional approaches steadily decreases when the mutation rate increases. Using the small reference 
database, the impact is more severe for NB than for VW : a drop of up to 10% is observed in the former 
case (from 85.1% for fragments down to 75.2% for a mutation rate of 10.7%) and almost 6% in the latter 
(from 92.4% down to 86.7%). The drop is even more severe using the larger database in both cases. 
Interestingly while it remains relatively constant around 10% for mutation rate greater than 4% with NB 
(hence twice the gap observed between the small and large datasets using fragments), it keeps increasing 
with VW and reaches 24% at the highest mutation rate considered. As a result, VW is outperformed 
by NB using the large reference database for mutation rates greater than 8%. Although these extreme 
configurations are not realistic regarding the current state of the NGS technologies, we emphasize, in 
agreement with the previous experiment on homopolymeric errors, that significant drops are observed 
with compositional approaches for moderate mutation rates, especially for large number of candidate 
species. For instance, with a mutation rate of 2%, the performances of VW and NB drop respectively 
by 4 and 5% with the large reference database, while this has no impact on the alignment approach. 
In this more realistic setting, the alignment-based approach shows markedly higher performances: it 
provides a median species-level accuracy of 91.7%, while VW and the NB classifier reach 83.9% 74.8%, 
respectively. 
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Figure 4: Robustness to sequencing errors: mutation-based models. These figures give the median 
species-level accuracy obtained with VW (purple), NB (green) and BWA (grey), obtained with the small 
(dashed lines) and large (solid lines) reference databases. Each approach has been evaluated on 11 
datasets simulated according to different mutation rates (from 1 to 10.7%) with the error model proposed 
in |Korbel et a/.|p009| ). 
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4.4 Classification speed 


Last but not least, we now turn to the comparison of the comparative and compositional approaches in 
terms of prediction time. This aspect is indeed of critical importance for the analysis of the large vol¬ 
umes of sequence data provided by next-generation sequencing technologies, and constitutes the main 
motivation of resorting to A:-mer based approaches. To perform this evaluation we measure the time 
taken by BWA-MEM and the fe-mer based approaches to process the 30 test datasets involved in the 
previous experiments (1 fragments dataset, 3 reads datasets with homopolymeric errors and 11 reads 
datasets with mutation errors, for the two reference databases considered). This allows us to investigate 
the impact of the number of species involved in the reference database, as well as the amount of se¬ 
quencing noise in the reads. We do not make a distinction between the two compositional approaches: 
both involve computing a score for each candidate species, defined as a dot product between the /c-mer 
profile of fhe sequence fo classify and a vector of weighfs obfained by fraining fhe model. To compufe 
fhis dof-producf efficienlly, we implemenfed a procedure described in Sonnenburg et al. (20061. Wifh 
fhis procedure, each A, T, G, C nucleotide is encoded by fwo bifs, which allows to direcfly converf a 
fe-mer as in integer befween 0 and 4*^ — 1. Provided fhaf fhe weighf vector is loaded info memory, fhe 
score can be compufed “on fhe fly" while evaluafing fhe fc-mer profile of fhe sequence fo be classified, 
by adding fhe confribufion of fhe currenf /c-mer fo fhe score. The drawback of fhis procedure lies in 
fhe facf fhaf fhe vecfors of weighfs defining fhe classificafion models need fo be loaded info memory, 
which can be cumbersome in a multiclass selling. For 193 and 774 species and A:-mers of size 12, fhis 
amounfed fo 12 and 48 gigabyfes, respeclively. 

Compulation limes are measured on a single CPU (Intel XEON - 2.8 Ghz) equipped wifh 250 GB of 
memory, and summarized in Figure]^ The lime needed to classify each read or fragmenl dalasel by fhe 
fe-mer approach shows liffle variation, for a given reference dalabase. The median value obfained across 
lesl dalasels reaches 5.4 and 9.1 minutes, using fhe small and large reference dalabase, respectively, 
hence aboul a Iwo-fold difference. This Iherefore amounls to classifying around 6.5 x 10^ and 3.9 x 10^ 
reads per minule, respeclively. BWA-MFM shows a differenl behavior. We observe fhaf fhe lime varies 
more across reads and fragmenl dalasels, and lends fo increase wifh fhe amounf of sequencing noise. 
On fhe ofher hand, fhe size of fhe reference dalabase has a lesser impacf, wifh af mosl an increase of 
20% befween fhe lime needed fo process a fesl dalasel wifh fhe small or large reference dalabases. The 
compositional approach syslemafically offers shorler predicfion limes, wifh an improvemenl of 3 fo 
almosl 15 limes, depending on fhe configuralion. 


5 Discussion 

In fhis work, we invesligale fhe polenfial of modern, large-scale machine learning approaches for lax- 
onomic binning of melagenomics dafa. We exlensively evaluate Iheir performance when fhe scale of 
fhe problem increases regarding (i) fhe lengfh of fhe fe-mers considered fo represenl a sequence, (ii) fhe 
number of fragmenls used to learn fhe model, and (iii) fhe number of candidale species involved in fhe 
reference dalabase. We also invesligale in delails Iheir robusfness to sequencing errors using simulaled 
reads. We consider fwo baselines for fhis evalualion: a comparalive approach based on fhe BWA-MEM 
sequence aligner and a composifional approach based on fhe generalive NB classifier. We demonsfrafe 
in parlicular fhaf increasing fhe number of fragmenls used fo frain fhe model has a significanl impacf on 
fhe accuracy of fhe model, and allows to esfimale models based on longer /c-mers. While fhis could be 
expected and was already highlighted by previous sludies, fhe resulling configuralions are ouf of reach 
of slandard SVM implemenlafions. We also show fhaf discriminalively frained composifional models 
usually offer significanfly higher performances lhan generalive NB classifiers. The resulling models are 
compefilive wifh well-eslablished alignmenf fools for problems involving a small fo moderale number 
of candidale species, and for reasonable amounls of sequencing errors. Our resulls suggesl, however, 
fhaf composifional approaches, bolh discriminative and generalive, are sfill limifed in Iheir abilify fo 
deal wifh problems involving more lhan a few hundreds species. In fhis case, indeed, composifional 
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computation times - smail reference database 



Figure 5: Classification times. The bars represent the time (in minutes) required to elassify eaeh frag¬ 
ment or read dataset using BWA-MEM, using the small (top) or large (bottom) referenee database. The 
dashed horizontal lines represent the median time required by VW. The figures shown on top of the bars 
represent the ratio between the times taken by BWA-MEM and VW. 
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approaches exhibit lower performance than alignment-based approaches and are much more negatively 
impacted by sequencing errors. Finally, we confirm that compositional approach achieve faster pre¬ 
diction times. This is indeed systematically the case in the various configurations listed above, with 
predictions obtained 3 to 15 times faster by compositional approaches, and, interestingly, depends on 
the number of candidate species and the level of sequencing noise. We emphasize, however, that fast 
predictions can only be obtained provided that the classification models are loaded in memory, hence 
for a memory footprint that scales linearly with the number of candidate species and exponentially with 
the size of the fc-mers, which can become important for large reference databases and long fe-mers. 

At least three simple extensions could be envisioned to make compositional approaches more com¬ 
petitive in accuracy with the alignment-based approach, faster, and to limit their memory footprint. First, 
the robustness to sequencing errors may be improved by learning models from simulated reads instead of 
fragments. This could indeed allow to tune the model to the sequencing technology producing the reads 
to be analyzed, provided its error model is properly known and characterized. Second, introducing a 
sparsity-inducing penalty while learning the model would have the effect of reducing the number of fea¬ 
tures entering the model, hence to reduce the memory footprint required to load the model into memory. 
Finally, alternative strategies, known as error correcting tournaments (Beygelzimer et al. 20091, could 
be straightforwardly considered to reduce the number of models to learn, hence to store into memory 
during prediction, to address a multiclass problem. Our results indeed suggest that addressing these 
issues is critical to build state-of-the-art compositional classifiers fo analyze mefagenomics samples fhaf 
may involve a broad specfrum of species. We emphasize however fhaf such large scale models can re¬ 
main compefifive for realistic amounfs of sequencing errors and a moderafe number of species (around 
200 in our sfudy), hence can already be useful in cases where fhe number of species fhaf can be encoun¬ 
tered is limited, which may in particular be fhe case for diagnosfic applications involving specific fypes 
of specimens. 
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